40 research outputs found

    Differentially Private Mixture of Generative Neural Networks

    Get PDF
    Generative models are used in a wide range of applications building on large amounts of contextually rich information. Due to possible privacy violations of the individuals whose data is used to train these models, however, publishing or sharing generative models is not always viable. In this paper, we present a novel technique for privately releasing generative models and entire high-dimensional datasets produced by these models. We model the generator distribution of the training data with a mixture of kk generative neural networks. These are trained together and collectively learn the generator distribution of a dataset. Data is divided into kk clusters, using a novel differentially private kernel kk-means, then each cluster is given to separate generative neural networks, such as Restricted Boltzmann Machines or Variational Autoencoders, which are trained only on their own cluster using differentially private gradient descent. We evaluate our approach using the MNIST dataset, as well as call detail records and transit datasets, showing that it produces realistic synthetic samples, which can also be used to accurately compute arbitrary number of counting queries.Comment: A shorter version of this paper appeared at the 17th IEEE International Conference on Data Mining (ICDM 2017). This is the full version, published in IEEE Transactions on Knowledge and Data Engineering (TKDE

    A Case Study: Privacy Preserving Release of Spatio-temporal Density in Paris

    Get PDF
    International audienceWith billions of handsets in use worldwide, the quantity of mobility data is gigantic. When aggregated they can help understand complex processes, such as the spread viruses, and built better transportation systems, prevent traffic con- gestion. While the benefits provided by these datasets are indisputable, they unfortunately pose a considerable threat to location privacy. In this paper, we present a new anonymization scheme to release the spatio-temporal density of Paris, in France, i.e., the number of individuals in 989 different areas of the city released every hour over a whole week. The density is computed from a call-data-record (CDR) dataset, pro- vided by the French Telecom operator Orange, containing the CDR of roughly 2 million users over one week. Our scheme is differential private, and hence, provides provable privacy guarantee to each individual in the dataset. Our main goal with this case study is to show that, even with large dimensional sensitive data, differential privacy can pro- vide practical utility with meaningful privacy guarantee, if the anonymization scheme is carefully designed. This work is part of the national project XData (http://xdata.fr) that aims at combining large (anonymized) datasets provided by different service providers (telecom, electricity, water man- agement, postal service, etc.)

    Differentially Private Sequential Data Publication via Variable-Length N-Grams

    Get PDF
    International audienceSequential data is being increasingly used in a variety of applications. Publishing sequential data is of vital importance to the advancement of these applications. However, as shown by the re-identi cation attacks on the AOL and Netflix datasets, releasing sequential data may pose considerable threats to individual privacy. Recent research has indicated the failure of existing sanitization techniques to provide claimed privacy guarantees. It is therefore urgent to respond to this failure by developing new schemes with provable privacy guarantees. Diff erential privacy is one of the only models that can be used to provide such guarantees. Due to the inherent sequentiality and high-dimensionality, it is challenging to apply di erential privacy to sequential data. In this paper, we address this challenge by employing a variable-length n-gram model, which extracts the essential information of a sequential database in terms of a set of variable-length n-grams. Our approach makes use of a carefully designed exploration tree structure and a set of novel techniques based on theMarkov assumption in order to lower the magnitude of added noise. The published ngrams are useful for many purposes. Furthermore, we develop a solution for generating a synthetic database, which enables a wider spectrum of data analysis tasks. Extensive experiments on real-life datasets demonstrate that our approach substantially outperforms the state-of-the-art techniques

    Probabilistic km^m-anonymity: Efficient Anonymization of Large Set-Valued Datasets

    Get PDF
    International audienceSet-valued dataset contains different types of items/values per individual, for example, visited locations, purchased goods, watched movies, or search queries.As it is relatively easy to re-identify individuals in such datasets, their release poses significant privacy threats.Hence, organizations aiming to share such datasets must adhere to personal data regulations.In order to get rid of these regulations and also to benefit from sharing, these datasets should be anonymized before their release.In this paper, we revisit the problem of anonymizing set-valued data. We argue that anonymization techniques targeting traditional \emph{k\textsuperscript{m}}-anonymity model, which limits the adversarial background knowledge to at most \emph{m} items per individual, are impractical for large real-world datasets.Hence, we propose a probabilistic relaxation of \emph{k\textsuperscript{m}}-anonymity and present an anonymization technique to achieve it.This relaxation also improves the utility of the anonymized data.We also demonstrate the effectiveness of our scalable anonymization technique on a real-world location dataset consisting of more than 4 million subscribers of a large European telecom operator.We believe that our technique can be very appealing for practitioners willing to share such large datasets

    Privacy-Preserving and Bandwidth-Efficient Federated Learning: An Application to In-Hospital Mortality Prediction

    Get PDF
    International audienceMachine Learning, and in particular Federated Machine Learning, opens new perspectives in terms of medical research and patient care. Although Federated Machine Learning improves over centralized Machine Learning in terms of privacy, it does not provide provable privacy guarantees. Furthermore, Federated Machine Learning is quite expensive in term of bandwidth consumption as it requires participant nodes to regularly exchange large updates. This paper proposes a bandwidth-efficient privacy-preserving Federated Learning that provides theoretical privacy guarantees based on Differential Privacy. We experimentally evaluate our proposal for in-hospital mortality prediction using a real dataset, containing Electronic Health Records of about one million patients. Our results suggest that strong and provable patient-level privacy can be enforced at the expense of only a moderate loss of prediction accuracy

    First report of maize redness disease in Hungary

    Get PDF
    Abstract During 2010, several maize production areas in Hungary were surveyed for the occurrence of maize redness (MR) disease symptoms associated with stolbur phytoplasma, as well as for the presence of the known vector of the disease, a planthopper Reptalus panzeri (Low). Incidence of maize plants with symptoms of reddening was low in all surveyed areas. Altogether, 25 symptomatic maize plants were collected at 9 localities and tested for phytoplasma presence. In addition, from one locality specimens of cixiids R. panzeri and Hyalesthes obsoletus Signoret were collected and PCR analyzed. Presence of stolbur phytoplasma in MR symptomatic maize plants and stolbur-infected R. panzeri was identified at the single locality Monorierdő in central Hungary. This finding represents the first report of MR presence in Hungary
    corecore